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Abstract 

In this era of the Internet, the amount of news articles added every minute of 
everyday is humongous. As a result of this explosive amount of news articles, 
news retrieval systems are required to process the news articles frequently 
and intensively. The news retrieval systems that are in-use today are not 
capable of coping up with these data-intensive computations. Cloudpress 2. 
presented here, is designed and implemented to be scalable, robust and fault 
tolerant. It is designed in such a way that, all the processes involved in news 
retrieval such as fetching, pre-processing, indexing, storing and summarizing, 
exploit MapReduce paradigm and use the power of the Cloud computing. It 
uses novel approaches for parallel processing, for storing the news articles 
in a distributed database and for visualizing them as a 3D visual. It uses 
Lucene-based indexing for efficient and faster retrieval. It also includes a 
novel query expansion feature for searching the news articles. Cloudpress 2. 
also allows on-the-fly, extractive summarization of news articles based on the 
input query. 
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1. Introduction 

The main functions of a news retrieval system involve fetching, processing 
and retrieval of news articles in any format such as text or image or video or 
audio or combination of any of those. News retrieval systems which are used 
nowadays, does not fully exploit parallelism to increase performance and to 
decrease time taken for retrieval. Furthermore, they are not designed to be 
powered by Cloud technology and to follow MapReduce approach. 

Generally, news articles have less number of mistakes as they are written 
by professionals and they are often pieces of larger stories which contain 
rich information about the people and places involved. However, they need 
to be fetched from the Internet, pre-processed, indexed and stored into the 
database. Often, due to the fact that amount of news articles is evergrowing, 
a distributed database is preferred for storage. MapReduce framework can 
be used to split each and every task into sub-tasks and then assign them to 
different worker nodes present in the Cloud for parallel processing. This can 
reduce the processing time, greatly. 

The paper encompasses a brief literature survey in the field of news re- 
trieval systems, parallel crawlers, news visualization and MapReduce pro- 
gramming model in Section 2. In Section 3, an overall architecture of the 
news retrieval system presented in this paper, is introduced. In Section 4, 
the implementation details of all the sub components comprising the system 
is explained. In Section 5, performance evaluation of the system is dealt 
with and finally, in Section 6, the concluding remarks with potential future 
enhancements are presented. 

2. Literature Survey 

2.1. News Retrieval System 

A novel design of a news retrieval tool is introduced in [l|. It makes use 
of an existing database of some newspapers such as Times. Two algorithms 
are presented in namely, conflation algorithm and relevance feedback 
algorithm. The conflation algorithm strips the suffixes off words, leaving a 
root stem. The rules used in this algorithm are very simple but the tool 
performs as well as many of the more complex algorithms that have been 
developed. The relevance feedback algorithm enables the user to add news 
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articles which are considered by the user to be relevant, to the input query 
and the system analyses it to extract keywords from it. This algorithm 
releases the burden of having to think up new words for the input query. 

A news filtering and summarization system presented in j^j can automat- 
ically recognize Web news pages, retrieve each news page's title and news 
content and extract key phrases. The key phrases extraction described in 
jij performs better than the methods based on term frequency and lexical 
chains. 

2.2. Parallel Crawler 

A scalable, extensible Web crawler named Mercator written entirely in 
Java, is discussed in 0. It enumerates the major components of any scal- 
able Web crawler, comments on alternatives and tradeoffs in their design, 
and describes the particular components used in Mercator. It also describes 
Mercator's support for extensibility and customizability. 

A scalable, distributed crawler named UbiCrawler introduced in jij, is 
characterized by platform independence, linear scalability, graceful degrada- 
tion if any faults occur and it has an assignment function which partitions 
the domain to crawl. The limitations of handling large sets of data have been 
explained and the techniques to overcome them are also presented in 

2.3. News Visualization 

A tree map visualization is used for visualizing news articles in j^j and it 
deals with some interactivity and abstraction issues as well. It focuses on the 
automatic generation of a hierarchical knowledge map called the NewsMap, 
based on online Chinese news, particularly the finance and health sections. 
The hierarchical knowledge map may be used as a tool for browsing busi- 
ness intelligence and medical knowledge hidden in news articles. NewsMap 
employs an improved interface combining a ID alphabetical hierarchical list 
and a 2D Self-Organizing Map (SOM) island display. 

2.4- MapReduce Paradigm 

The MapReduce paradigm is explained in Q as a programming model and 
an associated implementation for processing and generating large datasets 
that can be applied to various real-world tasks. It is explained that users can 
specify the computation in terms of a map and a reduce function, and the 
underlying runtime system automatically parallelizes the computation across 
large-scale clusters of machines, handles machine failures and schedules inter- 
machine communication to make efficient use of the network and disks. 
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3. Overall Architecture 



Every data-intensive process which takes place in the news retrieval sys- 
tem presented here, takes a MapReduce approach as shown in Figure [TJ 



Eucalyptus 
cloud with 
Hadoop 



Node 1 




Figure 1: MapReduce processing approach 

The overall architecture shown in Figure [2j shows two functions of the 
news retrieval system. They are News Dataset Generation and News Re- 
trieval and Visualization. News Dataset Generation involves crawling, pre- 
processing, indexing and storing the news articles in a parallel fashion. News 
Retrieval and Visualization involves query processing, query expansion, rank- 
ing, summarizing and retrieving news articles from the distributed database 
and finally, visualizing them as 3D visuals. 

4. Implementation 

All the processes involved in the news retrieval system presented in this 
paper are basically designed and implemented as maps and reduces which 
works on the Cloud using Hadoop framework. 
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Figure 2: Architecture of Cloudpress 2.0 



4-1. Parallel News Crawler 

News articles are fetched from the Internet simultaneously by various 
worker nodes in the Cloud, by using this parallel news crawler. The parallel 
news crawler is implemented using JAVA and Hadoop framework. The list of 
input URLs which resides in Hadoop Distributed File System (HDFS) is split 
among the worker nodes present in Cloud and each node crawls independently 
and stores the news articles into the distributed database. This is shown in 
the Figure [3j If crawling a particular URL takes a longtime then instead of 
waiting for that node to finish crawling, a new Hadoop worker node can be 
instantiated and the next URL can be assigned to it, so as to keep the speed 
of the crawling high, compared to the other crawlers which are in-use today. 

Basically, the nodes can be instantiated or terminated based on the work- 
load and thereby exploiting the full power of the Cloud technology. Further, 
the steps involved in the process of crawling done by each worker node, are 
shown in Figure HI 
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Figure 3: Working of the Parallel News Crawler 
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Figure 4: Steps involved in parallel news crawling in each node 



4-2. Parallel News Pre-Processor 

Pre-processing starts with getting the input from the distributed database. 
Then the first and foremost process to be carried out is the tokenization of 
the whole news article. In this step, all of the punctuation marks and sym- 
bols are removed by scanning each and every character in the news article. 
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After that, the words (sets of characters separated by a space in-between) are 
then counted one-by-one to get the total word count. Next, the stop words 
like prepositions, articles and some commonly occurring words are removed 
by comparing the tokenized words of the news article with a standard stop 
word list. This stop word list is kept centrally in HDFS, since it is needed 
by all the nodes. At last, the words in the news article are stemmed us- 
ing WordNet, which converts them to their root form without any prefix or 
suffix characters. The words which are not matching or not found in the 
WordNet like names of persons, places and numbers, are simply left to be 
the same. The completion of the stemming process also completes the full 
pre-processing stage. The pre-processed list of words are finally stored in the 
distributed database as Comma Separated Values (CSV s). 

4-3. Parallel News Indexer 

The parallel news indexer is implemented in JAVA which creates two 
Lucene-based indexes. The pre-processed CSVs from the distributed database 
is fetched and supplied as an input to generate one index which is used for 
faster retrieval of news articles, if the input query is for exact matching query 
terms. Another index is created with the original news articles (as they were 
before preprocessing). This index is created with term vectors and positional 
offsets and is used for proximity search queries, that is, fetching news articles 
with two or more keywords occurring within a particular distance from each 
other. For example, for the query "war iraq"~2, the resultant news articles 
would have both u war" and ll iraq n within two words of each other. This 
indexer also calculates and stores the Inverse Document Frequencies (IDF) 
of all the terms in the index. The IDF of each word W{ is given by ([!]), 

IDFi = log- (1) 

rii 

where, N = total number of articles in news database and rii = number 
of news articles that contain the word wv 

4-4- Query Processor, Expander, Ranker and Fetcher 

The basic steps involved in the query processing, expanding, ranking and 
fetching of news articles is shown in Figure El The input query can be for a 
summary or all relevant news articles. If it is for a summary, automatically 
the query expansion feature is disabled, so as to obtain a relavant summary 
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rather than a far-fetched one. First, the input query from the user is tokenized 
and the punctuations are removed, if necessary. The user can choose the 
way the input query needs to be processed. The user can choose whether 
AND/OR/NOT evaluation or wildcard character matching or the exact query 
term matching needs to be done for retrieval of the news articles. 

If the user opts for AND/OR/NOT evaluation, then the tokenized query 
terms are checked if it matches 'AND' or 'OR' or 'NOT'. If they do not match, 
then the query term can be directly used for retrieval by exact match. But 
if it matches any one of term, then corresponding outputs are obtained as 
follows: 

• If it matches 'AND' then, document IDs of the words occurring be- 
fore and after 'AND' are retrieved from inverted index and they are 
compared to get their intersection, which will give the final output (or) 

• If it matches 'OR' then, document IDs of the words occurring before 
and after 'OR' are retrieved from inverted index and combined (or) 

• If it matches 'NOT' then, document IDs of the word occurring after 
'NOT' are retrieved from inverted index and the final output will be 
all documents IDs except these. 

If the user opts for query expansion, then the tokenized query term is 
looked up for its synonym, hyponym and coordinate terms in WordNet. Then, 
all the document IDs of news articles containing any and all those terms are 
retrieved from the database. 

If the user opts for wildcard character matching, then the untokenized 
query term can be directly used for retrieval. For example, the query term 
like ' hel? ' or ' go* ' can be directly used to retrieve the document IDs. After 
the document IDs are obtained, their Term Frequency- Inverse Document 
Frequency (TFIDF) score is calculated. TFIDF score is given by ([2]), 

N 

TFIDF, = tfi x log - (2) 

m 

where, tfi = frequency of the word uii in the given news article, N = total 
number of articles in news database and = number of news articles that 
contain the word W{. 

Based on this score, the news articles are ranked. Until this, the process 
remains same for both an input query for summary and an input query for 
all relevant news articles. 
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Figure 5: Steps involved in query processing and expanding in each node 



But after the ranking is done, 

• if the input query was for all relevant news articles then, based on the 
TFIDF score, they are ordered and corresponding news articles are 
retrieved one by one and written to an XML file in the same order. An 
example output XML file is shown in the Figure |6j 

• if the input query was for a summary then the relevant news articles 
are fetched and then the steps shown in Figure [7] are followed to get a 
summary as an XML output file. 

4-5. 3D News Visualizer 

The architecture of the news retrieval and visualization modules is shown 
in the Figure EJ In the 3D News Visualizer, the XML file created by the pre- 
vious module is fed to the XML parser, which then parses the tags contained 
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<?xml version= '1.0' encoding= 'ISO-8859-l'?> 
<news> 

Karticle id = "1 " keywords= "oil,cargo,ship" headline = "'Massive' oil 
spill clean-up underway in NZ" date = "Thu, 13 - Oct - 2011 "> 

( CNN) — A major clean-up operation is underway along the north 
coast of New Zealand's North Island as debris and oil leaking from a 
cargo ship that ran aground on a reef wash ashore, officials say. . . 
CNN's Karen Smith and Laura Smith-Spark contributed to this 
report. 
</article> 
<news> 



Figure 6: An example output XML file 
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Figure 7: Steps involved in summary generation 




in the XML document to get all the news articles iteratively and supplies 
it to the news dispenser. Then, the 3D space generator creates a 3D screen 
space with a camera focusing on the front view. 

The news dispenser, dynamically adds the news articles to various points 
in 3D space based on previously ranked order, in other words, the z-axis 
coordinate or depth value of each news article is determined by the order or 
the 'id' attribute in each 'article' tag in the XML. The 'keywords' attribute 
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in each of the 'article ' tags is used to highlight the query terms previously 
given by the user as input to the retrieval system. 
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Figure 8: Architecture of retrieval and visualization modules 



There are two views associated with each news article in 3D screen space. 
They are, namely, Title view and Detailed view. The default view of the 
system is the Title view. In Title view, only the news article's ID, title and 
date of publishing will be visible. In Detailed view, the news article's ID, title, 
date of publishing and the full news story in form of text with highlighted 
keywords is displayed. 

The user-interface of 3D News Visualizer is implemented using Adobe 
Flash and all the necessary event handling is done using ActionScript 3.0. 
3D News Visualizer has the following interactive features: Panning, Zooming 
and Selecting/Deselecting. 

Panning can be performed by the use of arrow keys in the keyboard. In 
order to pan to the left direction, the user can press the left arrow key and 
likewise for right direction, the right arrow key can be used. Similarly, for 
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top and bottom directions, the up and down arrow keys can be used. While 
in detailed view, the mouse pointer can be moved in the direction the user 
wishes, to pan and the corresponding panning will occur. 

Zooming is nothing but the increase or decrease of depth values and sizes 
of all the news articles, to simulate the effect of flying through 3D space. It 
is achieved by the use of scroll wheel in the mouse. Scrolling up achieves 
zoom- in and scrolling down achieves zoom-out. It can also be performed by 
double clicking the left button in the mouse, over a title far in 3D space. 

Selecting /Deselecting can be done by a left button click of the mouse, over 
the title of the news article. It toggles between title view and detailed view. 
Every time the user clicks, the corresponding news article's ID is displayed 
at left bottom of the screen. And at the right bottom of the screen, the total 
number of retrieved news articles is displayed. 

5. Performance Evaluation 

5.1. Our Parallel Crawler vs Other Crawlers 

In this evaluation, the parallel crawler's efficiency and performance is 
compared with other parallel crawlers such as Ubicrawler and Mercator. Our 
parallel crawler was made to execute in an eight node setup. For this evalu- 
ation, in each trail, the number of RSS feeds are increased. Each RSS feed 
contains about 100-120 links or URLs in them. The results are shown in the 
Figure 




Number of URLs 

Figure 9: Execution Times of Our Parallel Crawler vs Other Crawlers 
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The results show that, for any given number of RSS feeds, our parallel 
crawler is approximately twice as fast as the other crawlers. This behavior 
of the parallel crawler is due to its use of Map-Reduce approach coupled 
with the efficient usage of Cloud resources. The parallel crawler splits the 
input URLs evenly and then assigns them to each of the nodes to crawl 
simultaneously. If one URL takes more than a set time limit to be fully 
crawled, then a new Hadoop node is instantiated and the crawling of the 
next URL begins. This is followed until the rest of the URLs are crawled. 
But in the other crawlers, the crawling is limited by the number of nodes 
present at start of the crawling process and dynamic instantiation of nodes 
is not efficiently done. The crawler described in this paper has the ability 
to cater a fully variable workload with ease and yet consume only sufficient 
amount of computing resources with full utilization. 

5.2. Normal Query vs Expanded Query 

For this evaluation, total number of news documents were increased at 
constant steps by adding news documents of random topics and each time 
the same query term is input to the system. At each trail, the query term is 
evaluated as a normal query and as an expanded query. 

For example, if the input query term is 'kill', then, for once it is taken as 
a normal query and an exact match with the news articles in the distributed 
database is performed to retrieve the resultant news articles. Then the same 
query term 'kill' is expanded which gives additional terms like 'out' and 
'eliminate'. This expanded query is then executed to retrieve the resultant 
news articles. 

The precision for both normal query and expanded query were calculated 
for each increase in the total number of documents. The precision graphs are 
shown in Figure [TU] and Figure [TTJ 

The results indicate that, when the number of documents increase, the 
precision value for the expanded query decreases. This variation in the pre- 
cision value is due to the fact that, at first, when the number of documents 
pertaining to each topic is smaller, the news articles matching the terms 'kill', 
'out' and 'eliminate' will be less as the news articles may not contain these 
terms at all or only topics like 'Crimes' or 'Law Enforcement' are present in 
the database. 

Later, if some other topics like 'Cricket' or 'Tennis' or even 'Politics' are 
added, these topics also may contain the same keywords as in the expanded 
query but not in the same context. For instance, in the topic 'Cricket', the 
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Figure 10: Precision Graph for Normal Queries 

keyword 'out' may occur, which has an entirely different meaning. This 
makes them irrelevant news articles, which in turn causes the rapid fall in 
precision. There may be a slight increase in the precision if at some trail new 
news documents of topics like 'Crime' are added again. 
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Figure 11: Precision Graph for Expanded Queries 
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In the case of normal query, the precision values did not show any drastic 
variations. This is because the evaluation occurs only by matching the word 
'kill' and not the others, this intern gives the resultant news in more or 
less the same context. The word 'kill' would not always occur in different 
topics in different contexts. It will occur in different topics like 'Politics' or 
'Tennis' only if it is a news about some political leader or a tennis player, 
actually, getting killed by someone or killing someone else. But sometimes, 
these retrieved news articles may become irrelevant if the user is looking for 
killings related to humans alone and not animals or birds. 

From this, we can infer that, for getting very relevant news articles or for 
known-item search or navigational search, normal querying is the best way. 
But for getting a wide spectrum of news articles or to retrieve possibly linked 
news stories or if the users are not very sure of what they are looking for 
then, expanded querying can be used. 

5.3. Comparisons of Retrieval done using Original News Index, Pre-processed 
News Index and without using Lucene-based Index 

For this evaluation, four different trials are done. First trial was for 
exact query term matching and the second trial was for retrieval involv- 
ing AND/OR/NOT evaluation. The third trial was for proximity search 
and retrieval and finally, the fourth trial was for retrieval involving wildcard 
character evaluation. The results are shown in the Figure fUZl 



6 




[ Without Index 



Original 
News Index 



Pre-processed 
News Index 



1 2 3 



Trials 



Figure 12: Retrieval Times for Various Methods of Retrieval 
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The results show that, for the first, second and fourth trials the retrieval 
time is more when retrieval is performed using the original index (index cre- 
ated without pre-processing the news articles). This is because the number of 
comparisons done are large, as it includes both stop words and non-stemmed 
words. 

But in the third trial, that is, when proximity search is performed using 
the original index, the retrieval time is lesser compared to the one performed 
using pre-processed index. This happens because the positional details of all 
the words in the news articles are stored in the original index. But it is lost 
when pre-processing is done before indexing the news articles. 

It is very clear that without using Lucene index, the retrieval takes very 
long and another advantage of Lucene-based index is its capability to be 
loaded into the database or filesystem or even the RAM of the system, which 
makes it flexible and efficient. Thus, retrieval done using Lucene-based index 
yields faster and efficient results for the input queries. 

6. Conclusion 

The next generation news retrieval system presented here, has met most 
of the pitfalls of today's news retrieval systems, such as, scalability, relia- 
bility and fault tolerance. The parallel news crawler used in this system 
is faster than the traditional crawlers, as it is designed using MapReduce 
programming model and is powered using Cloud technology. 

The parallel processing of news articles in MapReduce fashion ensures a 
hike in performance and also makes the news retrieval system more robust. 
The use of distributed database meets the huge storage space needed to store 
the evergrowing amount of news articles. The system is made fault tolerant, 
as replication of data is done after every write operation performed on the 
distributed database, automatically. 

The Lucene-based indexing offers to reduce the retrieval time by half. The 
query expansion feature and the extractive summarization presented here, 
makes the retrieval more efficient. The 3D visualization of the news articles 
makes this news retrieval system more interactive and engaging. Features like 
the processing and retrieval of news in the form of image or video or audio in 
addition to the news articles in textual format and automatic backup tasks 
can be thought as future enhancements. 
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